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Abstract 



Background: Modeling is the bottleneck to successful implementation of knowledge management systems. In 
this paper, we propose an evolutionary approach to modeling based upon word processing documents and we 
describe the tool Phoenix providing the technical infrastructure. 

Methods: We applied our approach and software system to authoring of medical case based training systems. 
So far, authors needed to either hand-code the content (usually as HTML) or to use highly sophisticated 
authoring systems which require instructions and experience to master the complex systems. With our approach 
we carry further the ideas Felciano and Dev put into practice in their system Short Rounds [4], They only 
presented pre-existing documents as an electronic patient record. Following our approach of evolutionary 
modeling, authors annotate documents to build fully flavored diagnostic training cases [5], 
Results: For our training environment d3web. Train 1 [6,7], we developed a tool to extract case knowledge from 
existing documents, usually dismissal records, extending Phoenix to d3web.Caselmporter [8]. Independent 
authors used this tool to develop training systems e.g. in rheumatology, gastroenterology, and cytology, 
observing a significant decrease of time for setteling-in (from several month down to 1 hour) and a decrease of 
time necessary for developing a case (down to 4-6 hours) [9]. 

Conclusions: This paper describes the general approach and provides an in-depth analysis of the document 
parsing engine (Phoenix) 2 . To generalize the success of d3web.Caselmporter, we conclude by sketching further 

1 http://www. d3webtrain.de ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
2 Phoenix is available under LGPL open source license from https://sourceforge.net/projects/phoenix-ie/ 
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existing applications of Phoenix, including a method to populate the expert system d3web 3 / Assist 4 and 
extensions still to come (e.g. for populating the Semantic Web [10]). 



1 Motivation 

Phoenix is a rule-based extraction engine to transform XML documents to arbitrary output formats. Our 
target developing Phoenix was to "compile" medical training cases, particularly for dSweb. Train, from 
Word or Open Office documents. 

By building upon well known tools and by re-using existing documents, wc seek to reduce authors' efforts 
both for learning and actual modeling. 

Following an evolutionary approach, authors alter and enhance existing content (dismissal records) 
step- by-step to model the desired content (training cases) [8,11]. For example, an author anonymizes 
observations (names, locations, dates), re-formats the document to enable automatic parsing, adds 
introduction and conclusion, formulates questions and provides feedback knowledge. 

As experience shows, we succeeded in our goals to reduce learning time and authoring effort [9]. This led to 
an increasing popularity among authors, since they can easily provide case-based supplements to lectures. 
However, Phoenix is by design a general-purpose tool and is used e.g. for populating xml content 
management systems and knowledge based systems 5 . 

First, we introduce Phoenix in high-level overview, going into details in the following sections. The second 
section analyzes the Phoenix information extraction algorithm, followed by an in depth look to the 
extension mechanisms necessary to process arbitrary content and store the information into the desired 
format. In addition, we show how to store the information extracted to XML back again. After that, we 
describe syntax and semantics of the Phoenix grammar, followed by the API definition. The paper closes 
with a look at existing Phoenix applications apart from case authoring and a lookout on features to come. 




http:/ /www. d3web.de 



http: / /www. knowit-software.de 



5 E.g. for Assist, http://www.knowit-software.de 
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2 Description 

Phoenix is a java-based engine to be extended in order to match concrete requirements. Figure ^ shows the 
architecture of Phoenix. As a rule-based extraction engine, Phoenix is initialized from a rule set. An user 
object holds the information extracted, and together with the document is input to Phoenix. To be 
applicable to different domains, Phoenix provides two extension mechanisms: Selectors to read information 
from the document and actions to process and store the information to the user object. 
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Figure 1: Phoenix architecture. 



Phoenix processes arbitrary XML documents as input data. However, Phoenix' main purpose is to process 
documents in Star Office/Open Office or MS Word format, e.g. discharge letters used for case authoring. 
Star Office/Open Office .sxw documents essentially are zipped XML documents, so Phoenix accesses these 
very easily. Phoenix is to natively support the Open Document Standard in future releases. 6 Microsoft 
Word documents are converted to .sxw automatically, using Open Office as a conversion server. Phoenix is 
also able to process HTML documents by using JTidy 7 as XML parser. 

Phoenix processes XML input documents as DOM 8 trees rather than as character input stream. However, 
processing does not work on single DOM tree nodes, but on blocks - node collections specified by the rule 
set. A block thus is a document fragment with all children matching certain criteria. Each rule set defines 
one or more block types, each specified by an XPath expression, a starting condition (based upon selectors) 

6 OpenOffice document format is basis for the Open Document Format for Office Applications, an OASIS standard supported 
by StarOffice, OpenOffice, KWord and hopefully by future versions of Microsoft Word. 

'http://jtidy.sourceforge.netl 

"DOM: Document Object Model, http://www.w3.org/DOM/ 
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and a grouping expression. Each block type defines a set of rules applied to all blocks of this type identified 
in the document. These rules fire on a block, if the rule's condition (again based on selectors) meets the 
block's content. Upon firing, the rules activate an action or recursively start another rule set on that block. 
We decided against XML documents as premium output. Instead, actions may alter the user object 
dependent on the block's content. First, this API-level access to arbitrary user objects enable the use of 
pre-existing libraries for knowledge representation, providing consistency checks, capsulation, and 
individual persistence. Second, the more general concept of actions provides a more powerful processing: 
Direct generation of XML makes it hard to re-structure information once written based upon information 
parsed later on. However, XML output is supported as a feature (see below) to provide an easy to use 
transformation output format. 

After Phoenix finished parsing a document, the user object is set up with all the information from the 
document. One can now use this object, e.g. by storing the data to a persistent representation. 

3 Algorithm 

Phoenix starts processing based upon an org. w3c.dom. Node representing the input document and a 
java. lang. Object as user object. Utility methods provide transparent access to .sxw, .doc, and .html 
documents. 

Basis for the parsing process is a rule set, providing a set of block type definitions. Each block type is 
specified by an XPath expression, a starting condition and a grouping expression. First, based upon these 
criteria, minimal blocks are generated by Phoenix: For each block type, document nodes matching the 
XPath expression are evaluated against the corresponding starting condition. If this starting condition is 
met, a minimal block is created. 

A condition either is a terminal condition (one of Exists, IntEquals, TextEquals, TextContains, 
TextStartsWith, TextEndsWith, TextMatches, or ParagraphStart 9 ) or a non-terminal condition (one of 
and, or, not, or min-max). Each terminal condition is configured with a selector fsee !4.1|) ; some require 
comparison values, e.g. TextEquals. A condition is checked against a block and returns a Boolean value. 
This return value is true if and only if the selector matches the content returned by the selector. 
Conditions are not only used for starting conditions, but also for grouping or end conditions, and action 
conditions (see below). 

In a second phase, the minimal blocks are expanded according to the corresponding block type's grouping 

9 The list of terminal conditions matches the current needs. It can be extended easily to reflect future requirements. 
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expression. This may be one of NONE, GRDUPING_EXPRESSIDN, END_EXPRESSION, and NEXT_BLOCK. Both 
GROUPING_EXPRESSION and END_EXPRESSION require an XPath expression. 

A NONE-grouping does not expand the minimal block. These blocks thus contain a single DOM node. A 
GROUPING_EXPRESSION adds the sequencing siblings of the block's starting node, if it matches the XPath 
specified. Vice versa, END_EXPRESSION adds all siblings as long as they do not match the XPath provided. 
NEXT_BLOCK as grouping type expands the block up to the beginning of the next block - this grouping type 
is the reason for the two-phase block construction process. 

For each block Phoenix checks all rules defined by this block's type. Rules are condition, action, rule set 
triples, where either action or rule set or both may be set. If the condition is met by the block, the action 
(see 14.20 fires and the rule set is applied to this block's content. 

These inner rule sets optionally specify pre- and post actions to switch the user object for the scope of this 
inner rule set. Therefore, the pre-action creates and returns a new user object. If no pre-action is given, 
the rule set inherits the knowledge container object. After inner rule set processing is finished, post action 
post-processed the extracted information and writes back the local user object to the original user object. 
After Phoenix finished traversing the document, the user object given is filled with information extracted. 
The user object then can be manipulated in arbitrary ways, e.g. stored in a database. 

4 Expansion mechanisms 

To allow Phoenix to fit into multiple environments, it provides flexible mechanisms for input content 
selection (selectors) and for writing result representation (actions). Selectors are to be used in any 
conditions (rule set starting conditions, grouping expressions, and rule conditions) or inside of actions. An 
action is part of a rule as defined above. 

4.1 Selectors 

Generally, selectors are referenced by their class and must be implementations of the interface 
de.knowit. phoenix. ruleEngine. Selector. This interface requires a single get-method. For a given 
block, a selector's get-method returns an org. w3c.dom. Node. This Node usually is a single node or a 
subset (as org.w3c.dom.DocumentFragment) from the block's contents. But, since one is free to implement 
arbitrary get-methods, a selector might also return information associated to the block, e.g. style 
information, or generated information, e.g. Date and Time. 

To provide a flexible mechanism, selectors are parameterizable if they implement the Interface 
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de . knowit . phoenix . ruleEngine . ParameterizedSelector. This is especially useful for the built-in 
selectors provided by Phoenix (see table^). While IdentitySelector and PositionSelector do not 
require parameters, XPathSelector requires an XPath expression ('xpath') and RegExpSelector requires 
a regular expression pattern ('regexp') as parameter. 



If the return value of a selector only depends upon the input block, the first computation of the return 
value can be cached to improve performance. Phoenix already supports this, if a Selector extends 
de . knowit. phoenix. ruleEngine. CachedSelector, overwriting the handleGet-method instead of the 
get-method. To improve management of word processing documents, Phoenix provides convenience 
methods to read style information - either directly applied to the content or via (inherited) masters. Thus, 
Selectors can return content based upon the style information associated to the content, e.g. bold text 
ending with a question mark. 

4.2 Actions 

Like selectors, Actions are referenced by their class (implementations of 

de . knowit . phoenix . ruleEngine . Action), configured by parameters (if the action class implements 
de. knowit. phoenix. ruleEngine. ParameterizedAction), and a selector (if de . 
knowit. phoenix. ruleEngine. ActionWithSelector is implemented) as a special kind of parameter. 
If a rule is activated, the perform- method of its action is called, receiving the actual block and the user 
object as parameters. In addition, a logger is given, so that all activities can be logged using the java 
logging API. 

Besides the Trace-action (de .knowit .phoenix, actions .Trace), which writes information to standard 
output and logs it, there is a predefined action for writing extracted information to a DOM tree. We will 
focus on this in the next section. 



Class 10 

IdentitySelector 
PositionSelector 
XPathSelector 
RegExpSelector 



xpath 
regexp 



Parameters Description 



Returns the blocks content as DocumentFragment 

Returns the blocks position in the list of all blocks. 

Returns the first node matching the XPath expression. 

Returns the DOM subtree for witch the text matches the given 

regular expression. Text nodes might be splitted, structure is cloned 

as necessary to keep text. 



Table 1: Phoenix's built-in selectors. 
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5 Store extracted information to XML 

As mentioned above, Phoenix provides the possibility to write extracted data back to XML again. By that 
way, one can transform semi-unstructured office documents to structured data represented in XML without 
the need to implement extensions. 

For XML data storage, an org. w3c.dom. Node must be used as user object, usually an 
org. w3c.dom. Document. A rule set pre action 

(de.knowit. phoenix. xmlUserObject .DescendNodePreAction) to locate a node, and an action 
(de.knowit. phoenix. xmlUserObject .SetNodeAction) to set a node value are provided by Phoenix. Both 
actions require two parameters: 'path' and 'overwrite'. 

Path is in XPatlr, a subclass of XPath: It denotes a sequence of Nodes, separated by a '/'-character. An 
attribute node is characterized by an '(§>' sign and may only be the last node in a path, e.g. 
/organization/person/@id. 

DescendNodePreAction selects the node specified by path as new user object. Therefore, it creates new 
Nodes if overwrite is true. Otherwise, existing nodes matching the XPatlr expression are re-used, new 
nodes are created as necessary. For DescendNodePreAction, path must end in an element reference. 
SetNodeAction also takes a selector - if no selector is given, SetNodeAction acts like an 
IdentitySelector was given. Dependent on the node type of the last path element, SetNodeAction 
performs: If the last path element is an attribute, this attribute's value is set to the text of the node 
returned by the given selector. If the last path element is an element node, the node returned by the 
selector is added as child to that node. 

6 Rule Set Definition 

Phoenix usually is running against many documents sharing a static common structure. Therefore, each 
application's grammar changes only little over time. This allows for manual grammar modeling, providing 
maximum flexibility. 

Phoenix defines rule sets by XML documents according to the Phoenix-XMLSchema 11 . Each rule set 

document provides a root node named RuleSet. Each RuleSet has an ID. Additionally, the RuleSet node 

defines necessary namespaces. 

<RuleSet ID="RS: default" 

xmlns : of f ice="http: //openof f ice . org/2000/of f ice" 
[...] 



1 http://ki.informatik.uni-wucrzburg.de/~bctz/phoenix/ 
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xmlns :text="http: //openof f ice . org/2000/text" 

xmlns :xsi="http : //www. w3 . org/2001/XMLSchema-instance" 

xsi :noNamespaceSchemaLocation 

="http : //ki . informatik.uni-wuerzburg.de/ 
~betz/phoenix/phoenix . xsd"> 

Furthermore, an inner RuleSet-Node may contain pre- and post-attributes, specifying the pre and post 
rule set actions by class name. 

<RuleSet ID="BlockSequence" 

pre="de . d3web . caseParser . actions . examinations . StartCaseParagraph" 
post="de . d3web. caseParser . actions . examinations . EndCaseParagraph"> 

Each RuleSet-Node comprises any number of block type definitions, where each Block has an ID-attribute, 
a Definition and a list of Rules. 

<Block ID="body"> 
<Def inition> 

<Start mat ches=" /text :p I /text:h"/> 
<Condition type="and"> 

<Condit ion type= "par agraphSt art " /> 
<Condition type="exists" 

selector="de.knowit .phoenix. selectors .RegexpSelector" 
selectorParameters="regexp=\s* ( . *) \s* : "/> 
<Grouping type="END_EXPRESSION"> 
<GroupingExpression 

matches="descendant-or-self : :* [contains (text () , 'Ende')] "/> 
</Grouping> 
</Def inition> 
<Rules> 
<Rule> 

</Rule> 
</Rules> 
</Block> 

The matches attribute to start specifies the XPath expression for the block starting node. The condition 
is given by type, selector (optional) and selectorParameters (optional). Conditions may be nested 
using the aggregation conditions (and, or, not, minmax). 

Grouping may be one of the types specified above, where NONE is expressed by omitting the Grouping tag. 
Thus, the following grouping tags are valid: 

<Grouping type= " GROUPING_EXPRESSION " > 
<GroupingExpression 

matches="descendant-or-self : :* [contains (text () , '<§')] "/> 
</Grouping> 

<Grouping type= " END_EXPRESSION " > 
<GroupingExpression 

matches="descendant-or-self : : * [contains (text () , 'Ende' )] "/> 
</Grouping> 
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<Grouping type="NEXT_BLOCK"/> 

Like the items above, rules are attributed with an ID. They share the syntax of Condition given above. 
Each rule possesses at most one Action tag, specifying the action class, and one RuleSet. 

<Rules> 

<Rule ID="R1"> 

<Condition type="contains" 

select or ="de . d3web . caseParser . selectors .TitleSelector" 
value="Def inition"/> 
<Action class="de .knowit .phoenix .xmlUserObject . SetNodeAction" 
parameters="path=@title ; overwrite=f alse"> 
<Source selector="example . selectors . StartingNodeSelector" /> 
</Action> 

<RuleSet ID="RS:R1"> 

</RuleSet> 
</Rule> 
</Rules> 

7 d3web.Caselmporter 

Our main target was to build an application ("d3web.CaseImporter 12 ") for extracting medical training 
cases from dismissal records (see [11]). With this application, we proof the concept of evolutionary 
modeling: Authors of medical training cases re-use existing documents, altering and extending content as 
needed. 

With only little changes, a dismissal record can be transformed into a training case: An author needs to 
make sure that the document's layout match the requirements given by Caselmporter. He usually strips 
unwanted formatting like headers and footers. Also, he ensures headings to be in the correct format: 
starting a new paragraph, boldfaced and ended by a colon. We chose the format used in most dismissal 
records, so the need for changes in the document is minimal. The heading for the list of diagnoses must be 
'Diagnosen'. The most important step in this first pass is anonymization: The author must remove any 
private data, including dates and locations. 

After the author performed these steps, he can upload his document to d3web. Caselmporter using a web 
browser. For each case, Caselmporter provides him with a log of parsing events (indicating possible 
problems using 'traffic lights') and a dump of case contents. Also, the author can directly start his case in 
d3web. Train. 

As students' pre-knowledge and learning goals require, author extends his case. He adds texts and images 
for introduction or conclusion and multiple choice questions. He improves presentation by adding images 

lz http: / /www. d3webtrain.de/author/ 
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(like x-rays, smears or screenshots of lab data forms) and he subjoins image interpretation tasks. For 
relating observations to diagnoses, author labels both with the same background color. 
From these documents, Caselmporter generates a structured representation based upon d3web's knowledge 
model: three terminologies (examinations, diagnoses, and therapies) are populated. Content and tasks 
related to these terminologies. For example, a diagnose selection task requires the learner to select 
diagnoses appropriate to a given situation from the terminology. Feedback then compares this selection to 
the list of diagnoses from the terminology given by the author, respecting even hierarchical relations. 
To implement Caselmporter, we developed appropriate selectors, actions, and a rule-set. We used selectors 
basically as shortcut to simplify rule-set definition and to implement the caching mechanism outlined 
above: e.g. TitleSelector and ContentSelector seperate title and content of a paragraph. Actions write 
to d3web's CaseObject and sub-parts, as inner rule sets create and select appropriate user objects (like 
CaseParagraphs). Only ImageExtraction action extracts an image included in the document to the file 
system, clipping the picture as necessary. 

8 Conclusion 

Since Phoenix as a general purpose tool, it is already in use in several projects: As a spin-off project from 
the case extraction engine, Phoenix parser was integrated into the knowledge modeling environment 
KnowME to import terminology from text or document files. The knowledge bases created with KnowME 
are used either in d3web 13 applications or in the consultation system Assist 14 . 

Completely separated from our main project, Phoenix is also used to populate a juridical eLcarning 
environment from Word documents. 

Future releases of Phoenix will include actions for building Semantic Web ontologies, building on the Jena 
15 framework. By this, we will carrying on the evolutionary approach to arbitrary Semantic Web 
applications, widening the modeling bottleneck. 

Experiences show that d3web. Caselmporter matches that goal for medical case based training systems: It 
was possible to reduce the time for settling-in from months down to an hour. Also, time for developing a 
single case was reduced, especially when compared to previous approaches to first build a complete 
diagnostic knowledge base for the domain or to reuse an existing one [12]. 

This speed-up led to an increasing acceptance of case-based training systems by authors. Now, even 

1,1 http://www.d3web.de 

14 Assist by knowIT-Software GmbH, http://www.knowit-software.de 
l!: |http: //jena.sourceforge.netTl 
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inexperienced authors are able to develop high-quality cases in a reasonable amount of time, e.g. when 
preparing a lecture. Training systems built using Caselmporter are well accepted by students [9], 
As Kraemer, co-author of an onkological system, puts it: "The d3web. Train system offers a new and great 
tool for creating a training program in a reasonable amount of time" [9] . 
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